Multitype drug interaction prediction based on the deep fusion of drug features and topological relationships

Drug–drug interaction (DDI) prediction has received considerable attention from industry and academia. Most existing methods predict DDIs from drug attributes or relationships with neighbors, which does not guarantee that informative drug embeddings for prediction will be obtained. To address this limitation, we propose a multitype drug interaction prediction method based on the deep fusion of drug features and topological relationships, abbreviated DM-DDI. The proposed method adopts a deep fusion strategy to combine drug features and topologies to learn representative drug embeddings for DDI prediction. Specifically, a deep neural network model is first used on the drug feature matrix to extract feature information, while a graph convolutional network model is employed to capture structural information from the adjacency matrix. Then, we adopt delivery operations that allow the two models to exchange information between layers, as well as an attention mechanism for a weighted fusion of the two learned embeddings before the output layer. Finally, the unified drug embeddings for the downstream task are obtained. We conducted extensive experiments on real-world datasets, the experimental results demonstrated that DM-DDI achieved more accurate prediction results than state-of-the-art baselines. Furthermore, in two tasks that are more similar to real-world scenarios, DM-DDI outperformed other prediction methods for unknown drugs.


Introduction
When multiple drugs are taken together, unexpected drug-drug interactions (DDIs) may occur, which may have either beneficial or detrimental effects on treatment. Beneficial DDIs have a "1+1 > 2" synergistic effect and thus can be exploited as a safe and effective therapeutic strategy for severe diseases, such as cancer and AIDS [1]. In contrast, harmful DDIs may lead to serious adverse drug reactions and even threaten a patient's life. Therefore, accurate identification of potential DDIs is an urgent and practical task. Currently, the main DDI identification methods need long-term clinical trials in vivo and in vitro, which are costly and included in this study, as DM-DDI and other comparative methods are performed on homogeneous drug graphs. Reviewing previous prediction methods, few works have focused on the deep fusion of drug features and topological relations. In these papers, the two types of information are usually combined relatively simply [30,31], as in GCNs, where matrices containing the network structure and feature attributes are constructed and jointly input, and these features are propagated along the network. However, when the number of layers becomes large, oversmoothing problems may be encountered. To address the above deficiencies, we propose a deep fusion strategy to exchange and fuse the information between drug features and topological networks. Specifically, we set up two channels to process drug features and topology separately (capturing drug features with AE and relational networks with GCN), and also designed delivery operations between each layer to exchange information. Before the output layer, we use an attention mechanism to fuse the information from the two channels and obtain the final drug embedding, which is used for drug interaction prediction.
In addition, most traditional methods formulate DDI prediction as a binary classification problem, using "0" or "1" to indicate the presence or absence of a reaction. These methods do not describe whether the reactions are beneficial or harmful and are unable to provide useful guidance for coadministration. Therefore, several methods that can predict DDI-related events were proposed later [32][33][34]. For instance, Deng et al. [34] proposed DDIMDL, in which three features (substructure, target, and enzyme) were calculated and fed into the constructed submodels for separate training. The results of the submodels were combined using a deep neural network. Finally, 65 types of predicted labels were used as outputs, and each type of label corresponded to rich interaction content with chemical and pharmacological descriptions. The specific prediction results promoted the understanding of the underlying mechanisms behind adverse drug reactions. Inspired by this idea, our proposed model also predicts multiple types of DDI events.
Our contributions are summarized as follows: 1. DM-DDI provides a novel deep fusion strategy to fuse drug features and topological relationships, which helps to obtain informative and predictive drug representations. The great performance in comparison with multiple state-of-the-art methods validates the advantages of DM-DDI.
2. We use an attention mechanism to achieve a weighted fusion of drug features and network topology. Moreover, the distribution of attention scores can provide interpretability for the drug prediction process.
3. DM-DDI can predict many types of DDI events and is not limited to binary prediction results. The more detailed DDI prediction results can give us more insight into the coadministration of different drugs.

Dataset
The DDI dataset used in the experiment comes from DDIMDL [34], which contains 37,264 pairwise interactions between 572 drugs characterized by four features: chemical substructure, target, enzyme, and pathway. Based on previous studies [34], we selected the three features with the best effects (chemical substructure, target, and enzyme) for similarity calculation. We introduce how to calculate the similarity of drugs using the chemical structure as an example. PubChem defines 583 types of substructures (other feature descriptors are shown in Table 1), so each drug can be represented by a set of 583-bit feature descriptors with the value "1" or "0" indicating the presence or absence of the corresponding substructure. Finally, we can obtain a feature matrix of the chemical structures with the shape of (572, 583), and the same operation is performed on the target and enzyme features. Since the obtained feature matrices have high dimensionality and contain many "0" values, which may degrade the performance of the model, we conduct the Jaccard similarity calculation for each feature matrix to mitigate the impact, as shown in Eq (1).
Where di and dj represent bit vectors drugs i and j, respectively; |di \ dj| is the intersection of di and dj, and |di \ dj| is the union. After the Jaccard similarity calculation, we can obtain three 572 × 572 feature similarity matrices (X s , X t , X e ), which are concatenated as an initialized feature matrix X. The feature vector of drug i is represented as shown in Eq (2), and the symbol � denotes the concatenation operation.

Overview
The overall framework of DM-DDI is shown in Fig 1. The DDI matrix can be constructed as a network G = {V, E}, where the vertices V denote the drugs involved and the edges E denote the types of drug reactions. The three drug features (chemical substructure, target, and enzyme) can be calculated by Eq (1) to obtain three similarity matrices (X s , X t , X e ), which are concatenated as a unified drug feature matrix X. The model is composed of four main modules: the node feature extraction module uses a DNN model on the feature matrix to capture drug feature information, and the learned feature embedding is represented by H; the  structural relationship extraction module employs the GCN model on the constructed drug network to learn the structural information, and the learned structure embedding is represented by Z; the deep fusion module includes delivery operations between layers and an attention mechanism before the output layer, and the final representation E for all drugs can be obtained after deep fusion; and model optimization module adopts four different combination methods (Average, Hadamard, L1-norm, and Concatenation) to construct drug pair vectors (DPs) and uses the reaction types between the drug pairs as labels. Finally, the DPs and label are input into the cross-entropy loss function for optimization.

Node feature extraction module
To learn feature embeddings from raw data, we use the widely used autoencoder (AE). Because the AE model not only captures the nonlinear relationship between input and output quickly but also effectively reduces dimensionality. The AE model includes an encoder-decoder, where the encoder is acted by the DNN model. We treat the embedding obtained after encoding using Eq (3) as drug features. The details are as follows.
Assuming that the DNN model has l layers and that H (l) denotes the embedding learned at layer l, we formulate H (l) as follows.
where ; represents the ReLU activation function, and W (l) and b (l) are the weight matrix and bias matrix at layer l th , respectively. H (o) denotes the original feature matrix X. The decoder uses the inner product operation for decoding, and we train the model by minimizing the reconstruction loss.

Topology relation extraction module
The AE model can learn important features of each layer, e.g., H (1) , H (2) . . .. . .H (i) , but it ignores the structural relationships between drugs. Therefore, we adopt the GCN model to extract the topological information between drugs. The structural embedding vector of layer l th can be obtained from Eq (4).
whereÃ ¼ A þ I andD ¼ P jÃij :I is the identity diagonal matrix of the adjacent matrix A:Ã is a self-loop matrix, andD is the degree matrix. Z (l−1) is the embedding vector learned at layer (l0 − 1) th , and W (l−1) is a trainable weight matrix that is used to map the information learned from layer (l − 1) th to layer l th .
Considering that the AE model can obtain different levels of feature information, we integrate two embeddings between each layer to obtain a more informative drug representation, as shown in Eq (5).Z Where α is the fusion coefficient, which is used to control the fusion weight of the learned vectors between the GCN model and the AE model. Then,Z ðlÀ 1Þ is fed into the next GCN model to obtain the fused representation Z (l) , as shown in Eq (6).
where ; represents the ReLU activation function. In this way, the feature embedding H (l−1) at layer (l−1) th can be propagated through the normalized adjacency matrix and the same is true for the other layers.

Deep fusion
The key to the model is learning informative and high-quality drug embeddings, which help to promote the model performance. The deep fusion strategy uses delivery operations to realize intralayer and interlayer fusion and utilizes an attention mechanism to complete the final fusion. Eventually, the obtained drug embeddings can simultaneously retain the node features and topological relationships of different layers.

Delivery operation.
(1) Intralayer fusion. Between each layer, the drug feature embedding learned by the AE model is transferred to the GCN model with a certain weight α (refer to Eq (5)). Thus,Z ðlÀ 1Þ can accommodate two different types of information, i.e., drug features and the interaction between drugs.
(2) Interlayer fusion. Generally, the shallow layer (near the input) extracts low-level features, which contain more detailed information but more noise, and the deep layer (near the output) tends to extract high-level features with stronger semantic information but poorer detail perception. Therefore, effectively fusing information from different layers and overcoming the oversmoothing problem that GCN models may suffer is crucial to improving the overall model performance.
To the best of our knowledge, oversmoothing is mainly due to the overemphasis on the relationship with neighbor nodes while ignoring the node features during aggregation [30]. Kipf et al. [30] proposed residual connectivity to transfer the node feature information learned in the upper layer of the model to the next convolutional layer. Since then, a series of improved models have been proposed [35,36]. For example, He et al. [36] proposed a model called LightGCN, which considers the representation of different GCN layers. In other words, this model mitigates the oversmoothing problem by integrating the granularity information of different layers. Inspired by this idea, we design Eqs (5) and (6) to cope with this problem. Specifically, we deliver the node feature information extracted from layer (l − 1) th of the AE model to the GCN model for weight fusion. Then, the learned vector is used as feature vectors in the next iteration of the convolutional network to achieve interlayer fusion. Thus, the oversmoothing problem can be relieved by amplifying the node feature information. Bo et al. [37] adopted similar delivery operations in exchanging feature and topological information and demonstrated that such operations not only integrate structural information but also solve the GCN model's oversmoothing problem.

Attention mechanism.
Considering the different importance of the two learned embeddings for the prediction task, it is necessary to adopt an attention mechanism to assign learnable weights to fuse.
Given Z (l) and H (l , which are obtained from the GCN model and AE model at the last layer l, respectively, the attention mechanism is calculated as shown in Eq (7).
where α Z and α h denote the attention coefficients of embedding Z and H, respectively. The detailed calculation process is as follows. Take node i as an example. Node i in embedding vector Z can be represented as Z i . We first apply a nonlinear transformation and multiply by the shared attention vector q to obtain its attention value w i z , as shown in Eq (8).
where w is the weight matrix, b is a bias vector, and q is a shared attention vector. We can also obtain the attention value w i h of the drug feature embedding. Then, we use the SoftMax function to regularize the attention value and obtain its attention coefficient, as shown in Eq (9). Similarly, For n nodes, the weight coefficients α z = [α z ] and α h = [α h ], which are denoted α z = diag(α z ) and α h = diag(α h ), respectively. Finally, the final drug embedding representations E are obtained as shown in Eq (10).

Model optimization
After deep fusion, a 572 × 65 embedding vector E of all drugs are obtained, where row i and row j represent two drug vectors, denoted Ф(di) and Ф(dj), respectively. We provide four different combinations of methods to construct them into drug pair vectors (DPs), as shown in Table 2.
The symbol J indicates the Hadamard operation, while the symbol L indicates the concatenation operation. Note that the dimensionality of DPs changes only when the concatenation method is selected.
Suppose that the training set is L and that the set of label categories is c. Given a drug pair i, i 2 L, we can calculate the cross-entropy loss by using Eq (11).
The true label y c i ; c 2 C indicates that drug pair i belongs to class c. The predicted labelŷ c i denotes the probability that the predicted result belongs to class c. The prediction result for n drug pairs is denotedŶ ¼ŷ c

Baselines
There are two types of comparison methods: feature-based methods and structure-based methods. For the feature-based methods, only the feature matrix is input. We compare DM-DDI with popular multiple reaction prediction models (DeepDDI, DDIMDL) as well as classical classifiers (LR, RF, and DNN). For structure-based methods, only the adjacency matrix is input. Representative embedding learning methods (LINE, HOPE, Node2Vec, and SDNE), as well as DPDDI and SkipGNN, are selected. Note that for the representative method,

PLOS ONE
we could reimplement BioNEV [15] to save the learned embedding for downstream experiments.
• Cascade drug features We represent drug-drug pairs as feature vectors and use the interaction types as labels. Then, they are fed into the classifier for training and prediction. These methods are traditional supervised learning methods, which we instantiate with logistic regression (LR) [38] and random forest (RF) [39] classifiers.
• DNN Deep neural networks (DNNs) and Lee's [33] ideas are similar. They both directly concatenate three feature similarity matrices and feed them to a DNN classifier. The difference is that the DNN in this paper does not use autoencoders to reduce the dimension but directly combines the feature matrix as the input features instead.
• DeepDDI DeepDDI [32] selects a chemical structure feature matrix, which is dimensioned to 100 by principal component analysis (PCA), as a drug feature. Then, it is fed into a deep learning model for DDI prediction. In this work, we changed the output of the original implementation from the 86th class to the 65th class.
• DDIMDL DDIMDL [34] calculates the similarity of three drug-related features (substructure, target, enzyme). When combined directly, the different features may suffer from interference with each other. Thus three drug-related feature matrices are separately fed into three constructed submodels for training. Then, the DNN model is built for drug reaction prediction.
• HOPE HOPE [40] is a factorization method in which the model uses generalized singular value decomposition to decompose the adjacency matrix to maintain higher-order proximity.

• Node2Vec
The graph embedding learned by node2vec [41] can represent both homogeneous similarity (the more shared neighbors there are, the greater the similarity) and structural similarity (the more similar the role that is played, the greater the similarity). The model adopts a biased random walk strategy to obtain neighbor nodes, and the hyperparameters are set to p = q = 1; therefore, the model can contain two different similarities.
• LINE LINE [42] focuses on learning embeddings based on distributional similarity. First-order nearest neighbors and second-order nearest s are trained separately, and then the two vectors are combined by the inner product as the vector representation.
• SDNE SDNE [43] utilizes deep AE models to maintain first-and second-order network proximity. This model uses highly nonlinear functions to obtain embeddings that retain local and global structural information.
• DPDDI The DPDDI model [26] uses graph convolutional networks (GCNs) to extract low-dimensional structural embeddings from the drug relationship network and then predicts the DDI using deep neural network (DNN) models.
• SkipGNN SkipGNN [25] receives messages from two-hop neighbors and immediate neighbors in the interaction network; thus, it captures higher-order topology information for DDI prediction.

Metric evaluation
To comprehensively evaluate the performance of DM-DDI, we use the accuracy (ACC), area under the precision-recall curve (AUPR), area under the receiver operating characteristic (ROC) curve (AUC), F1, precision (Pre), and recall as the evaluation metrics. We use micro metrics for AUPR and AUC and macro metrics for the other metrics (Pre_macro, F1_macro, Recall_macro). Pre_micro, Recall_micro, and F1_micro scores are the same as ACC scores in the multiclass task. Due to the imbalanced distribution of the DDI dataset, we focus on the AUPR_micro score, which provides a more accurate assessment of the model's performance.

Experimental setup
The DM-DDI model utilizes multi-layer fusion with different numbers of nodes at different layers. We first discuss the effect of the number of layers on the experiment. In this paper, we consider five layers, and the number of nodes per layer is empirically set to {512, 1024, 2000, 256, 65}. We fix the number of neurons in the last hidden layer to 65, as there are 65 types of DDI events. Then, the number of layers is gradually increased to see how the number of layers affects the experimental results. The model learning rate hyperparameter is set to 0.003, and the number of epochs is set to 1,000, as the loss curve does not change when the number of epochs is close to 1,000. We apply fivefold cross-validation and randomly split all DDI pairs into five subsets in our experiments. We train models based on DDIs in the training set and then make predictions for DDIs in the test set. The evaluation score is the average of the output of the five rounds. We select the Adam optimizer to optimize DM-DDI and use the early-stopping strategy to prevent overfitting, which automatically stops the training if no improvement is observed after 20 epochs. The parameters of the other comparative models follow the original paper.

Comparison with state-of-the-art methods.
To make a fair comparison, we uniformly saved the results learned by the model as embedding vectors and then classified them with a DNN classifier. All results are evaluated by 10 runs of validations under the same conditions. The results are shown in Table 3, and the best results are marked in bold. Table 3 shows that the DM-DDI model outperforms the other comparison methods, especially in the F1_macro, Pre_macro, and Recall_macro metrics (the highest F1_macro in the comparison method is 0.780, while the score of F1_macro in DM-DDI model is 0.852). The result demonstrates the great performance of the DM-DDI, which may benefit from incorporating different levels of feature and structural information. An interesting phenomenon is observed: the Node2Vec model works best among the structure-based methods. A possible reason is that it incorporates two types of similarities (homogeneous and structural similarities) and therefore can better represent the topological relationships of neighbors. In contrast, the DPDDI and SDNE models seem to underperform, which we speculate may be caused by data imbalance. Because the imbalanced dataset has a great impact on the GCN convolution effect [44], the encoders of the DPDDI and SDNE models exactly use the GCN model. Note that the DM-DDI model also captures topology information by using the GCN module.

Ablation study.
To explore the effectiveness of each model component, we set the following ablation variants (shown in Table 4) for ablation experiments.
• DM-DDI w/o AE: This variant has no AE module and sets the fusion coefficient alpha = 0.
The DM-DDI model degrades to a multilayer GCN model; thus, the model captures topological information.
• DM-DDI w/o GCN: This variant has no GCN module, and the embedded representation obtained by the AE model is directly used for end-to-end training. Only the attribute feature information of the drugs is used.
• DM-DDI w/o Att: This variant has no attention mechanism. The obtained embeddings after GCN fusion are summed with the embedding learned by AE as the final drug embedding vector. This ablation variant is used to explore the impact of the attention mechanism.
• DM-DDI w/o Delivery: This variant has no delivery operation for cross-fusion. Two embeddings from the GCN and AE modules are input into the attention mechanism to obtain the fused embedding vector. This variant is designed to verify the importance of the delivery operation.
The results from the bar chart in Fig 2 show that DM-DDI exceeds all ablation variants in all metrics, demonstrating the validity of each module of the model. Specifically, the ablation variant DM-DDI w/o AE significantly decreases in numerous metrics compared to the DM-DDI model, indicating that drug features contribute significantly to the prediction of multiple drug reactions.

Multi-class DDI dataset for link prediction.
To verify the speculation in Section 3.4.1, we counted the reaction events for all classes and the results are shown in Fig 3. The

PLOS ONE
results show that the distribution is unbalanced over 65 classes: the number of DDI events decreases significantly as the number of classes increases, with only a few numbers in the final class. To explore more deeply, we remove classes with few labels at the end and extract the first 30 classes of the DDI dataset (termed Class_30), and also select the first 10 classes of the DDI dataset (termed Class_10) with the popular multitype DDI prediction models (DeepDDI, DPDDI, and DDIMDL) for link prediction experiments. The experimental results are shown in Table 5.
As seen from the experimental results in Table 5, when there are fewer classes, all models obtain higher scores on multiple metrics. This demonstrates that the performance of the models is affected by the data balance. The proposed model exhibits optimal performance in all cases, which shows that DM-DDI has a stable network structure. In addition, DPDDI experiences the most significant increase, with the value of F1_macro increasing by 25.6% from 0.491 in the Class_65 dataset to 0.747 in the Class_10 dataset. The observed increase could be attributed to a better convolution effect of the GCN encoder in DPDDI. Moreover, we output the distribution of attention values on different classes as auxiliary information. The results are shown in Fig 4. As the DDI dataset class decreases, the weight of attention values for the GCN embedding continues to increase and even exceeds that of the AE. This suggests that the ability of the GCN model to capture network topology information increases, leading to a higher contribution to the DDI prediction. These two findings further support the experimental  PLOS ONE speculation in Section 3.4.1. According to the above observations, we can infer that the GCN component can work better when the classes become more balanced and DM-DDI might achieve better results.

Multi-task analysis.
Usually, we are more concerned with the ability to predict unknown drugs, so we design tasks A and B that are closer to real cases for the experiment. Task A aims to predict the reaction between known drugs and unknown drugs, and Task B predicts the reaction between two new drugs. The two tasks differ from previous experiments, which predict unobserved reactions among known drugs, while this section predicts reactions with new drugs. Another difference is that the dataset in this section is divided based on drugs rather than drug pairs, and we randomly divide 572 drugs into 5 subsets and take 20% (115 drugs) for testing to simulate drugs without known interactions. In Task A, DM-DDI is performed on the training drugs and tested between the training drugs and testing drugs. In Task B, model training is also performed on the training drugs, whereas the predictions are all taken  The results from the two tasks show that the performance of all models drops obviously when predicting new drugs. However, the proposed model significantly outperforms other competitive models in all metrics. For example, in Fig 5B, the DM-DDI model outperforms the second-best DDIMDL model in six metrics, namely, ACC, AUPR_micro, AUC_micro, F1_macro, Pre_macro, and Recall_macro, by 11.9%, 11.1%, and 3.6%, 7%, 7.6%, and 10.5%, respectively, which demonstrates the effectiveness of the model for new drug predictions. The advantage of prediction may benefit from the mutual reinforcement of drug features and topological information, and DM-DDI can make predictions from drug features even without neighbors.
3.4.5 Sensitivity analysis. In our work, three essential parameters are included: the coefficient α, the number of fusion layers, and the drug pair combination method. The set of α is given as {0.1, 0.3, 0.5, 0.7, 0.9}, the number of fusion layers varies from 1 to 5 layers, and the combination methods include the Average, Hadamard, L1-norm, and Concatenation methods. We conduct experiments to analyze the influences of parameters according to the prediction performances and fix the specific parameter with the best results. The experimental results are given in Fig 6. As shown in Fig 6A, when the number of GCN layers is 3, the model achieves the best results. This outcome may be attributed to the appropriate capture of topological relations and

PLOS ONE
drug feature information. Too few layers cannot capture sufficient information, while too many layers introduce noise and degrade the model performance. Therefore, we fix the number of layers to 3, and the number of hidden layer nodes is set to {2000, 256, 65}. In Fig 6B, we can see that when utilizing the average method, the model performs best overall, which means that the average combination approach can maintain the meanings of drug pairs well. The results in Fig 6C show that the best result is obtained when the fusion coefficient α is set to 0.5, namely, when the GCN embedding is equally fused with the AE embedding.

Case study.
A desirable DDI prediction model should not only pursue good prediction accuracy but also pursue the ability to accurately predict the types of drug interactions [45]. Therefore, this experiment conducts 65 types of DDI event prediction among 572 drugs and 37,264 drug pairs with known interactions. We predict the remaining 289,920 unknown drug reactions to verify the model's ability. We focus on the top 10 classes of DDI events with the highest frequencies and output DDI events with the highest prediction scores. We find evidence in DrugBank [46] for verification, and the experimental results are shown in Table 6. Out of the top ten predicted DDI events with the highest scores, eight reactions are confirmed. For example, the true drug interaction type that occurs between abiraterone and fentanyl is "1", and the predicted probability that falls into type "1" is 0.93. Therefore, the prediction is correct, which means that the metabolism of fentanyl can be decreased when combined with abiraterone.
Furthermore, we counted the top 100 drug pairs with the highest prediction scores, and the results are displayed in Fig 7. The size of the drug node correlates with the number of reactions; the red edge indicates that the interaction is confirmed, and the grey edge indicates that no evidence is found. Sixty-nine percent of the reactions in the graph are validated. The case study results show that the proposed model can predict the unknown DDI events well.

Conclusion
We propose a novel drug interaction prediction model (DM-DDI), which deeply fuses drug features with topological information of neighbors through delivery operation and attention mechanisms. Finally, the information-rich drug embedding vectors are obtained by end-toend training. The experimental results showed that DM-DDI outperformed numerous competitive methods, even on unbalanced 65 class DDI datasets. The prediction accuracy can reach 0.908, and the AUPR can reach 0.964. The case study's prediction results are well- readable and more accurate, which provides strong support for drug interaction prediction. However, our model still has the following limitations: only a single drug-drug relationship was used to capture topological information. In the future, we will collect multitype drugrelated entities (e.g., drugs, proteins, diseases, and side effects) and construct multi-relationship networks to characterize the structural embedding of drugs. In addition, we only utilized basic model components to learn feature representations, such as the GCN model used to capture topological information. Later, we can replace them with more advanced models to better address the imbalance problem.